Rafiki: A Middleware for Parameter Tuning of NoSQL Datastores for Dynamic Metagenomics Workloads

ثبت نشده
چکیده

High performance computing (HPC) applications, such as metagenomics and other big data systems, need to store and analyze huge volumes of semi-structured data. Such applications o‰en rely on NoSQL-based datastores, and optimizing these databases is a challenging endeavor, with over 50 con€guration parameters in Cassandra alone. As the application executes, database workloads can change rapidly from read-heavy to write-heavy ones, and a system tuned with a read-optimized con€guration becomes suboptimal when the workload becomes write-heavy. In this paper, we present a method and a system for optimizing NoSQL con€gurations for Cassandra and ScyllaDB when running HPC and metagenomics workloads. First, we identify the signi€cance of con€guration parameters using ANOVA. Next, we apply neural networks using the most signi€cant parameters and their workload-dependent mapping to predict database throughput, as a surrogate model. Œen, we optimize the con€guration using genetic algorithms on the surrogate to maximize the workloaddependent performance. Using the proposed methodology in our system (Rafiki), we can predict the throughput for unseen workloads and con€guration values with an error of 7.5% for Cassandra and 6.9-7.8% for ScyllaDB. Searching the con€guration spaces using the trained surrogate models, we achieve performance improvements of 41% for Cassandra and 9% for ScyllaDB over the default con€guration with respect to a read-heavy workload, and also signi€cant improvement for mixed workloads. In terms of searching speed, Rafiki, using only 1/1000-th of the searching time of exhaustive search, reaches within 15% and 9.5% of the theoretically best achievable performances for Cassandra and ScyllaDB, respectively— supporting optimizations for highly dynamic workloads. ACM Reference format: Elided. 2016. Rafiki: A Middleware for Parameter Tuning of NoSQL Datastores for Dynamic Metagenomics Workloads. In Proceedings of ACM Conference, Washington, DC, USA, July 2017 (Conference’17), 13 pages. DOI: 10.1145/nnnnnnn.nnnnnnn

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Benchmarking Replication in Cassandra and MongoDB NoSQL Datastores

The proliferation in Web 2.0 applications has increased the volume, velocity, and variety of data sources which have exceeded the limitations and expected use cases of traditional relational DBMSs. Cloud serving NoSQL data stores address these concerns and provide replication mechanisms to ensure fault tolerance, high availability, and improved scalability. In this paper, we empirically explore...

متن کامل

A Comparison of Data Models and APIs of NoSQL Datastores

NoSQL datastore systems are a new generation of non-relational databases. More than fifty NoSQL systems have been already implemented, each with different characteristics — especially, with different data models and different APIs to access the data. In this paper we describe and compare the data models and operations offered by a number of representative NoSQL datastores, which we have directl...

متن کامل

RangeMerge: Online Performance Tradeoffs in NoSQL Datastores

Datastores are distributed systems that manage enormous amounts of structured data for online serving and batch processing applications. The NoSQL datastores weaken the traditional relational and transactional model in favor of horizontal scalability. They usually support concurrent operations with demanding throughput and latency requirements which may vary across different workload types. A t...

متن کامل

A Simple Approach for Executing SQL on a NoSQL Datastore

NoSQL datastores have been initially introduced to support a few concrete extreme scale applications. Limited query and indexing capabilities were therefore not a major impediment, as the specificity and scale of the target application justified the investment in manually crafting application code. With a number of alternatives now available and mature, there is an increasing willingness to use...

متن کامل

Access control in ultra-large-scale systems using a data-centric middleware

  The primary characteristic of an Ultra-Large-Scale (ULS) system is ultra-large size on any related dimension. A ULS system is generally considered as a system-of-systems with heterogeneous nodes and autonomous domains. As the size of a system-of-systems grows, and interoperability demand between sub-systems is increased, achieving more scalable and dynamic access control system becomes an im...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017